Employee Promotion means the ascension of an employee to higher ranks, this aspect of the job is what drives employees the most. The ultimate reward for dedication and loyalty towards an organization and HR team plays an important role in handling all these promotion tasks based on ratings and other attributes available.
The HR team in JMD company stored data of promotion cycle last year, which consists of details of all the employees in the company working last year and also if they got promoted or not, but every time this process gets delayed due to so many details available for each employee - it gets difficult to compare and decide.
So this time HR team wants to utilize the stored data to make a model, that will predict if a person is eligible for promotion or not.
You as a data scientist at JMD company, need to come up with a model that will help the HR team to predict if a person is eligible for promotion or not.
Explore and visualize the dataset. Build a classification model to predict if the customer has a higher probability of getting a promotion. Optimize the model using appropriate techniques. Generate a set of insights and recommendations that will help the company.
There are two parts to the submission:
Submission will not be evaluated if:
Happy Learning!!
Perform an Exploratory Data Analysis on the data. Points: 6
Illustrate the insights based on EDA. Points: 5
Data Pre-processing. Points: 6
Model building - Logistic Regression. Points: 6
Model building - Bagging and Boosting. Points: 9
Hyperparameter tuning using grid search. Points: 9
Hyperparameter tuning using random search. Points: 9
Model Performances. Points: 5
Actionable Insights & Recommendations. Points: 5
Total Points: 60
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error
df = pd.read_csv('employee_promotion.csv')
version_dict = {0: 'df'}
df.head()
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 0 | 49.0 | 0 |
| 1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 60.0 | 0 |
| 2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 50.0 | 0 |
| 3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 50.0 | 0 |
| 4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 73.0 | 0 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 54808 entries, 0 to 54807 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 employee_id 54808 non-null int64 1 department 54808 non-null object 2 region 54808 non-null object 3 education 52399 non-null object 4 gender 54808 non-null object 5 recruitment_channel 54808 non-null object 6 no_of_trainings 54808 non-null int64 7 age 54808 non-null int64 8 previous_year_rating 50684 non-null float64 9 length_of_service 54808 non-null int64 10 awards_won 54808 non-null int64 11 avg_training_score 52248 non-null float64 12 is_promoted 54808 non-null int64 dtypes: float64(2), int64(6), object(5) memory usage: 5.4+ MB
unique = df.nunique() # get number of unique values
null = df.isnull().sum() # get number of null values
count = df.count() # get total count
percent_null = (null/count*100).round(2) # calculate percent of null values
df_uncp = pd.concat([unique, null, count, percent_null], axis=1) # concatenate all variables above
df_uncp.columns = ['unique', 'null', 'count', 'percent_null'] # name columns
df_uncp # df unique, null, count, percent
| unique | null | count | percent_null | |
|---|---|---|---|---|
| employee_id | 54808 | 0 | 54808 | 0.00 |
| department | 9 | 0 | 54808 | 0.00 |
| region | 34 | 0 | 54808 | 0.00 |
| education | 3 | 2409 | 52399 | 4.60 |
| gender | 2 | 0 | 54808 | 0.00 |
| recruitment_channel | 3 | 0 | 54808 | 0.00 |
| no_of_trainings | 10 | 0 | 54808 | 0.00 |
| age | 41 | 0 | 54808 | 0.00 |
| previous_year_rating | 5 | 4124 | 50684 | 8.14 |
| length_of_service | 35 | 0 | 54808 | 0.00 |
| awards_won | 2 | 0 | 54808 | 0.00 |
| avg_training_score | 59 | 2560 | 52248 | 4.90 |
| is_promoted | 2 | 0 | 54808 | 0.00 |
print('dupes: ', df.duplicated().sum())
dupes: 0
print('total percent null: ', df_uncp['percent_null'].sum())
total percent null: 17.64
# get a snapshot of all columns where education is null
df[df.education.isna() == True]
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 29934 | Technology | region_23 | NaN | m | sourcing | 1 | 30 | NaN | 1 | 0 | 77.0 | 0 |
| 21 | 33332 | Operations | region_15 | NaN | m | sourcing | 1 | 41 | 4.0 | 11 | 0 | 57.0 | 0 |
| 32 | 35465 | Sales & Marketing | region_7 | NaN | f | sourcing | 1 | 24 | 1.0 | 2 | 0 | 48.0 | 0 |
| 43 | 17423 | Sales & Marketing | region_2 | NaN | m | other | 3 | 24 | 2.0 | 2 | 0 | 48.0 | 0 |
| 82 | 66013 | Sales & Marketing | region_2 | NaN | m | sourcing | 2 | 25 | 3.0 | 2 | 0 | 53.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54692 | 14821 | Sales & Marketing | region_2 | NaN | f | sourcing | 1 | 35 | 3.0 | 7 | 0 | 53.0 | 0 |
| 54717 | 7684 | Analytics | region_2 | NaN | m | sourcing | 1 | 32 | 3.0 | 4 | 0 | 86.0 | 0 |
| 54729 | 1797 | HR | region_2 | NaN | f | other | 1 | 28 | 3.0 | 2 | 0 | 53.0 | 0 |
| 54742 | 38935 | Sales & Marketing | region_31 | NaN | m | other | 1 | 28 | 4.0 | 3 | 0 | 47.0 | 0 |
| 54806 | 13614 | Sales & Marketing | region_9 | NaN | m | sourcing | 1 | 29 | 1.0 | 2 | 0 | NaN | 0 |
2409 rows × 13 columns
# iterate through selected columns and print displot of all 4 education values to compare to each other
edu_nan_cols = ['age', 'length_of_service', 'no_of_trainings', 'avg_training_score', 'previous_year_rating', 'is_promoted']
for col in edu_nan_cols:
sns.displot(df[col][df.education.isna() == True])
sns.displot(df[col][df.education == 'Below Secondary'])
sns.displot(df[col][df.education == 'Bachelor\'s'])
sns.displot(df[col][df.education == 'Master\'s & above'])
/Users/ivansaucedo/opt/anaconda3/envs/aiml/lib/python3.8/site-packages/seaborn/axisgrid.py:392: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). fig, axes = plt.subplots(nrow, ncol, **kwargs)
# drop null values in education column & create new revision; df drop na v01
df_dna = df.copy()
version_dict[1] = 'df_dna'
df_dna.dropna(subset=['education'], inplace=True)
df_dna
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | region_7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 0 | 49.0 | 0 |
| 1 | 65141 | Operations | region_22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 60.0 | 0 |
| 2 | 7513 | Sales & Marketing | region_19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 50.0 | 0 |
| 3 | 2542 | Sales & Marketing | region_23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 50.0 | 0 |
| 4 | 48945 | Technology | region_26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 73.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54802 | 6915 | Sales & Marketing | region_14 | Bachelor's | m | other | 2 | 31 | 1.0 | 2 | 0 | 49.0 | 0 |
| 54803 | 3030 | Technology | region_14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 78.0 | 0 |
| 54804 | 74592 | Operations | region_27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 56.0 | 0 |
| 54805 | 13918 | Analytics | region_1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 0 | 79.0 | 0 |
| 54807 | 51526 | HR | region_22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 49.0 | 0 |
52399 rows × 13 columns
# replace null values with average in previous_year_rating columns
df_dna.previous_year_rating.fillna(df_dna.previous_year_rating.mean(), inplace=True)
# replace null values with average in avg_training_score columns
df_dna.avg_training_score.fillna(df_dna.avg_training_score.mean(), inplace=True)
# confirm no null values
df_dna.isnull().sum()
employee_id 0 department 0 region 0 education 0 gender 0 recruitment_channel 0 no_of_trainings 0 age 0 previous_year_rating 0 length_of_service 0 awards_won 0 avg_training_score 0 is_promoted 0 dtype: int64
# extract only the number from the region string using regex & create new revision: df transform categorical v02
df_tcat = df_dna.copy()
version_dict[2] = 'df_tcat'
df_tcat.region = df_tcat.region.str.extract('(\d+)')
df_tcat
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | 7 | Master's & above | f | sourcing | 1 | 35 | 5.0 | 8 | 0 | 49.0 | 0 |
| 1 | 65141 | Operations | 22 | Bachelor's | m | other | 1 | 30 | 5.0 | 4 | 0 | 60.0 | 0 |
| 2 | 7513 | Sales & Marketing | 19 | Bachelor's | m | sourcing | 1 | 34 | 3.0 | 7 | 0 | 50.0 | 0 |
| 3 | 2542 | Sales & Marketing | 23 | Bachelor's | m | other | 2 | 39 | 1.0 | 10 | 0 | 50.0 | 0 |
| 4 | 48945 | Technology | 26 | Bachelor's | m | other | 1 | 45 | 3.0 | 2 | 0 | 73.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54802 | 6915 | Sales & Marketing | 14 | Bachelor's | m | other | 2 | 31 | 1.0 | 2 | 0 | 49.0 | 0 |
| 54803 | 3030 | Technology | 14 | Bachelor's | m | sourcing | 1 | 48 | 3.0 | 17 | 0 | 78.0 | 0 |
| 54804 | 74592 | Operations | 27 | Master's & above | f | other | 1 | 37 | 2.0 | 6 | 0 | 56.0 | 0 |
| 54805 | 13918 | Analytics | 1 | Bachelor's | m | other | 1 | 27 | 5.0 | 3 | 0 | 79.0 | 0 |
| 54807 | 51526 | HR | 22 | Bachelor's | m | other | 1 | 27 | 1.0 | 5 | 0 | 49.0 | 0 |
52399 rows × 13 columns
# transform gender into binary
gender = {'m': 1,'f': 0}
df_tcat.gender = [gender[item] for item in df_tcat.gender]
df_tcat
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | 7 | Master's & above | 0 | sourcing | 1 | 35 | 5.0 | 8 | 0 | 49.0 | 0 |
| 1 | 65141 | Operations | 22 | Bachelor's | 1 | other | 1 | 30 | 5.0 | 4 | 0 | 60.0 | 0 |
| 2 | 7513 | Sales & Marketing | 19 | Bachelor's | 1 | sourcing | 1 | 34 | 3.0 | 7 | 0 | 50.0 | 0 |
| 3 | 2542 | Sales & Marketing | 23 | Bachelor's | 1 | other | 2 | 39 | 1.0 | 10 | 0 | 50.0 | 0 |
| 4 | 48945 | Technology | 26 | Bachelor's | 1 | other | 1 | 45 | 3.0 | 2 | 0 | 73.0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54802 | 6915 | Sales & Marketing | 14 | Bachelor's | 1 | other | 2 | 31 | 1.0 | 2 | 0 | 49.0 | 0 |
| 54803 | 3030 | Technology | 14 | Bachelor's | 1 | sourcing | 1 | 48 | 3.0 | 17 | 0 | 78.0 | 0 |
| 54804 | 74592 | Operations | 27 | Master's & above | 0 | other | 1 | 37 | 2.0 | 6 | 0 | 56.0 | 0 |
| 54805 | 13918 | Analytics | 1 | Bachelor's | 1 | other | 1 | 27 | 5.0 | 3 | 0 | 79.0 | 0 |
| 54807 | 51526 | HR | 22 | Bachelor's | 1 | other | 1 | 27 | 1.0 | 5 | 0 | 49.0 | 0 |
52399 rows × 13 columns
# get the unique values in the education column
df_tcat.education.unique()
array(["Master's & above", "Bachelor's", 'Below Secondary'], dtype=object)
# convert education into sequential tier/rank order
edu = {'Below Secondary': 1,'Bachelor\'s': 2, 'Master\'s & above':3}
df_tcat.education = [edu[item] for item in df_tcat.education]
df_tcat.head()
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | 7 | 3 | 0 | sourcing | 1 | 35 | 5.0 | 8 | 0 | 49.0 | 0 |
| 1 | 65141 | Operations | 22 | 2 | 1 | other | 1 | 30 | 5.0 | 4 | 0 | 60.0 | 0 |
| 2 | 7513 | Sales & Marketing | 19 | 2 | 1 | sourcing | 1 | 34 | 3.0 | 7 | 0 | 50.0 | 0 |
| 3 | 2542 | Sales & Marketing | 23 | 2 | 1 | other | 2 | 39 | 1.0 | 10 | 0 | 50.0 | 0 |
| 4 | 48945 | Technology | 26 | 2 | 1 | other | 1 | 45 | 3.0 | 2 | 0 | 73.0 | 0 |
# get the unique values in the recruitment_channel column
df_tcat.recruitment_channel.unique()
array(['sourcing', 'other', 'referred'], dtype=object)
# one-hot encoding of recruitment_channel & create new revision: df one hot encoding v03
df_ohe = pd.get_dummies(df_tcat, columns=['recruitment_channel'])
version_dict[3] = 'df_ohe'
df_ohe.head()
| employee_id | department | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | is_promoted | recruitment_channel_other | recruitment_channel_referred | recruitment_channel_sourcing | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | 7 | 3 | 0 | 1 | 35 | 5.0 | 8 | 0 | 49.0 | 0 | 0 | 0 | 1 |
| 1 | 65141 | Operations | 22 | 2 | 1 | 1 | 30 | 5.0 | 4 | 0 | 60.0 | 0 | 1 | 0 | 0 |
| 2 | 7513 | Sales & Marketing | 19 | 2 | 1 | 1 | 34 | 3.0 | 7 | 0 | 50.0 | 0 | 0 | 0 | 1 |
| 3 | 2542 | Sales & Marketing | 23 | 2 | 1 | 2 | 39 | 1.0 | 10 | 0 | 50.0 | 0 | 1 | 0 | 0 |
| 4 | 48945 | Technology | 26 | 2 | 1 | 1 | 45 | 3.0 | 2 | 0 | 73.0 | 0 | 1 | 0 | 0 |
# get the unique values in the department column
df_ohe.department.unique()
array(['Sales & Marketing', 'Operations', 'Technology', 'Analytics',
'R&D', 'Procurement', 'Finance', 'HR', 'Legal'], dtype=object)
# one-hot encoding of department
df_ohe = pd.get_dummies(df_ohe, columns=['department'])
df_ohe.head()
| employee_id | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | ... | recruitment_channel_sourcing | department_Analytics | department_Finance | department_HR | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | 7 | 3 | 0 | 1 | 35 | 5.0 | 8 | 0 | 49.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 65141 | 22 | 2 | 1 | 1 | 30 | 5.0 | 4 | 0 | 60.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 7513 | 19 | 2 | 1 | 1 | 34 | 3.0 | 7 | 0 | 50.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 2542 | 23 | 2 | 1 | 2 | 39 | 1.0 | 10 | 0 | 50.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 48945 | 26 | 2 | 1 | 1 | 45 | 3.0 | 2 | 0 | 73.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
5 rows × 23 columns
df_ohe.dtypes
employee_id int64 region object education int64 gender int64 no_of_trainings int64 age int64 previous_year_rating float64 length_of_service int64 awards_won int64 avg_training_score float64 is_promoted int64 recruitment_channel_other uint8 recruitment_channel_referred uint8 recruitment_channel_sourcing uint8 department_Analytics uint8 department_Finance uint8 department_HR uint8 department_Legal uint8 department_Operations uint8 department_Procurement uint8 department_R&D uint8 department_Sales & Marketing uint8 department_Technology uint8 dtype: object
# transform region column into int64 type
df_ohe.region = df_ohe.region.astype('int64')
df_ohe.dtypes
employee_id int64 region int64 education int64 gender int64 no_of_trainings int64 age int64 previous_year_rating float64 length_of_service int64 awards_won int64 avg_training_score float64 is_promoted int64 recruitment_channel_other uint8 recruitment_channel_referred uint8 recruitment_channel_sourcing uint8 department_Analytics uint8 department_Finance uint8 department_HR uint8 department_Legal uint8 department_Operations uint8 department_Procurement uint8 department_R&D uint8 department_Sales & Marketing uint8 department_Technology uint8 dtype: object
# transform all uint8-type columns with int64
for col in df_ohe.columns:
if df_ohe[col].dtype == 'uint8':
# print(col)
df_ohe[col] = df_ohe[col].astype('int64')
df_ohe.dtypes
employee_id int64 region int64 education int64 gender int64 no_of_trainings int64 age int64 previous_year_rating float64 length_of_service int64 awards_won int64 avg_training_score float64 is_promoted int64 recruitment_channel_other int64 recruitment_channel_referred int64 recruitment_channel_sourcing int64 department_Analytics int64 department_Finance int64 department_HR int64 department_Legal int64 department_Operations int64 department_Procurement int64 department_R&D int64 department_Sales & Marketing int64 department_Technology int64 dtype: object
df_describe = pd.concat([df_ohe.describe().T, df_ohe.skew().round(3)], axis=1) # concatenate df_ohe.describe and df_ohe.skew, round skew to 3 decimal places
df_describe['mean'] = df_describe['mean'].round(1) # round mean to 1 decimal place
df_describe['std'] = df_describe['std'].round(2) # round std to 2 decimal places
df_describe.rename(columns = {'50%':'median'}, inplace = True) # rename 50%
df_describe.rename(columns = {0:'skew'}, inplace = True) # rename skew
df_describe = df_describe.loc[:, ['mean', 'median', 'skew', 'std', 'min', 'max']] # recreate dataframe only with selected values
df_describe.iloc[4:11,:] # print only selected columns
| mean | median | skew | std | min | max | |
|---|---|---|---|---|---|---|
| no_of_trainings | 1.3 | 1.0 | 3.436 | 0.61 | 1.0 | 10.0 |
| age | 35.0 | 33.0 | 1.014 | 7.62 | 20.0 | 60.0 |
| previous_year_rating | 3.3 | 3.0 | -0.327 | 1.21 | 1.0 | 5.0 |
| length_of_service | 5.9 | 5.0 | 1.728 | 4.28 | 1.0 | 37.0 |
| awards_won | 0.0 | 0.0 | 6.339 | 0.15 | 0.0 | 1.0 |
| avg_training_score | 64.0 | 62.0 | 0.404 | 13.13 | 39.0 | 99.0 |
| is_promoted | 0.1 | 0.0 | 2.936 | 0.28 | 0.0 | 1.0 |
for col in df_ohe.columns[4:11]: # iterate through each column in the list of columns: df_ohe.columns[4:11]
sns.displot(df_ohe[col], kind="kde") # print displot of of selected columns
# view quantiles from 0 to 100% for selected columns
from numpy import arange
df_quantiles = df_ohe.describe(percentiles=[i for i in arange(.1,1, 0.1)]).T
df_quantiles = df_quantiles.iloc[1:,3:15]
df_quantiles.iloc[3:10,:]
| min | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| no_of_trainings | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.000000 | 1.0 | 1.0 | 2.0 | 10.0 |
| age | 20.0 | 27.0 | 29.0 | 30.0 | 32.0 | 33.0 | 35.000000 | 37.0 | 40.0 | 46.0 | 60.0 |
| previous_year_rating | 1.0 | 1.0 | 3.0 | 3.0 | 3.0 | 3.0 | 3.337526 | 4.0 | 5.0 | 5.0 | 5.0 |
| length_of_service | 1.0 | 2.0 | 3.0 | 3.0 | 4.0 | 5.0 | 6.000000 | 7.0 | 8.0 | 11.0 | 37.0 |
| awards_won | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.0 |
| avg_training_score | 39.0 | 48.0 | 51.0 | 54.0 | 58.0 | 62.0 | 65.000000 | 71.0 | 79.0 | 83.0 | 99.0 |
| is_promoted | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.0 |
log_cols = ['no_of_trainings', 'age', 'length_of_service', 'avg_training_score'] # list of columns to take logarithm
for col in log_cols:
sns.displot(df_ohe[col], kind="kde") # print original displot
sns.displot(np.log(df_ohe[col]), kind="kde") # print displot after taking logarithm of columns
# print('red wine, success! ', col)
# compare skew before and after applying logarithm
skew = [[df_ohe.no_of_trainings.skew(), np.log(df_ohe.no_of_trainings).skew()]]
skew.append([df_ohe.age.skew(), np.log(df_ohe.age).skew()])
skew.append([df_ohe.length_of_service.skew(), np.log(df_ohe.length_of_service).skew()])
skew.append([df_ohe.avg_training_score.skew(), np.log(df_ohe.avg_training_score).skew()])
df_skew = pd.DataFrame(skew, columns=['skew', 'log skew'], index=['no_of_trainings', 'age', 'length_of_service','avg_training_score'])
df_skew
| skew | log skew | |
|---|---|---|
| no_of_trainings | 3.435561 | 2.034291 |
| age | 1.013941 | 0.499550 |
| length_of_service | 1.727902 | -0.302036 |
| avg_training_score | 0.403527 | 0.165607 |
# iterate through log_cols defined above and apply logarithm
for col in log_cols:
df_ohe[col] = np.log(df_ohe[col])
# print('red wine, success! ', col)
# confirm skew was transformed
df_ohe.skew()[log_cols]
no_of_trainings 2.034291 age 0.499550 length_of_service -0.302036 avg_training_score 0.165607 dtype: float64
df_ohe.columns
Index(['employee_id', 'region', 'education', 'gender', 'no_of_trainings',
'age', 'previous_year_rating', 'length_of_service', 'awards_won',
'avg_training_score', 'is_promoted', 'recruitment_channel_other',
'recruitment_channel_referred', 'recruitment_channel_sourcing',
'department_Analytics', 'department_Finance', 'department_HR',
'department_Legal', 'department_Operations', 'department_Procurement',
'department_R&D', 'department_Sales & Marketing',
'department_Technology'],
dtype='object')
sns.scatterplot(x='no_of_trainings', y='is_promoted', data=df_ohe)
plt.show()
sns.scatterplot(x='avg_training_score', y='is_promoted', data=df_ohe)
plt.show()
# heatmap correlation excluding the one hot encoded columns
sns.heatmap(df_ohe.corr().iloc[1:11,1:11])
<AxesSubplot:>
sns.heatmap(df_ohe.corr())
<AxesSubplot:>
df_ohe.corr().round(3)
| employee_id | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | ... | recruitment_channel_sourcing | department_Analytics | department_Finance | department_HR | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| employee_id | 1.000 | 0.003 | 0.004 | -0.002 | -0.004 | 0.000 | 0.004 | -0.000 | 0.008 | -0.000 | ... | 0.004 | 0.000 | 0.007 | 0.009 | -0.006 | -0.002 | 0.005 | -0.005 | -0.003 | -0.003 |
| region | 0.003 | 1.000 | -0.076 | 0.104 | -0.002 | -0.239 | -0.022 | -0.141 | 0.008 | 0.023 | ... | 0.000 | 0.105 | -0.019 | -0.041 | -0.019 | 0.030 | -0.070 | 0.014 | 0.009 | -0.030 |
| education | 0.004 | -0.076 | 1.000 | -0.017 | -0.038 | 0.419 | 0.010 | 0.305 | 0.000 | 0.016 | ... | -0.001 | -0.043 | -0.053 | -0.007 | -0.054 | -0.001 | 0.060 | 0.050 | 0.005 | 0.012 |
| gender | -0.002 | 0.104 | -0.017 | 1.000 | 0.090 | -0.002 | -0.022 | -0.012 | 0.003 | -0.031 | ... | 0.004 | 0.146 | 0.017 | -0.054 | 0.049 | -0.123 | -0.134 | 0.073 | 0.152 | -0.075 |
| no_of_trainings | -0.004 | -0.002 | -0.038 | 0.090 | 1.000 | -0.088 | -0.059 | -0.059 | -0.008 | 0.048 | ... | -0.010 | 0.062 | 0.023 | -0.077 | -0.044 | -0.078 | 0.031 | 0.037 | 0.028 | 0.007 |
| age | 0.000 | -0.239 | 0.419 | -0.002 | -0.088 | 1.000 | 0.007 | 0.640 | -0.007 | -0.055 | ... | -0.003 | -0.097 | -0.092 | -0.025 | -0.022 | 0.083 | 0.047 | -0.030 | 0.032 | -0.013 |
| previous_year_rating | 0.004 | -0.022 | 0.010 | -0.022 | -0.059 | 0.007 | 1.000 | 0.001 | 0.027 | 0.070 | ... | -0.005 | 0.055 | 0.028 | 0.024 | 0.006 | 0.120 | -0.012 | 0.024 | -0.129 | -0.053 |
| length_of_service | -0.000 | -0.141 | 0.305 | -0.012 | -0.059 | 0.640 | 0.001 | 1.000 | -0.032 | -0.039 | ... | 0.003 | -0.060 | -0.065 | -0.024 | -0.061 | 0.072 | 0.037 | -0.035 | 0.026 | -0.012 |
| awards_won | 0.008 | 0.008 | 0.000 | 0.003 | -0.008 | -0.007 | 0.027 | -0.032 | 1.000 | 0.067 | ... | -0.007 | 0.002 | 0.007 | -0.006 | 0.001 | 0.000 | 0.002 | -0.001 | -0.009 | 0.007 |
| avg_training_score | -0.000 | 0.023 | 0.016 | -0.031 | 0.048 | -0.055 | 0.070 | -0.039 | 0.067 | 1.000 | ... | -0.008 | 0.481 | -0.040 | -0.233 | -0.029 | -0.092 | 0.216 | 0.203 | -0.671 | 0.472 |
| is_promoted | 0.001 | -0.012 | 0.025 | -0.011 | -0.024 | -0.016 | 0.153 | -0.008 | 0.195 | 0.171 | ... | -0.000 | 0.012 | -0.003 | -0.023 | -0.018 | 0.008 | 0.015 | -0.009 | -0.028 | 0.030 |
| recruitment_channel_other | -0.004 | 0.014 | 0.012 | -0.007 | 0.014 | 0.017 | -0.015 | 0.006 | 0.006 | -0.001 | ... | -0.957 | 0.000 | 0.010 | 0.007 | 0.004 | -0.003 | 0.004 | -0.000 | -0.006 | -0.005 |
| recruitment_channel_referred | 0.000 | -0.049 | -0.036 | 0.009 | -0.014 | -0.046 | 0.067 | -0.033 | 0.003 | 0.028 | ... | -0.128 | -0.011 | -0.030 | 0.032 | -0.008 | -0.001 | -0.028 | -0.002 | -0.023 | 0.073 |
| recruitment_channel_sourcing | 0.004 | 0.000 | -0.001 | 0.004 | -0.010 | -0.003 | -0.005 | 0.003 | -0.007 | -0.008 | ... | 1.000 | 0.003 | -0.001 | -0.017 | -0.002 | 0.003 | 0.004 | 0.001 | 0.013 | -0.016 |
| department_Analytics | 0.000 | 0.105 | -0.043 | 0.146 | 0.062 | -0.097 | 0.055 | -0.060 | 0.002 | 0.481 | ... | 0.003 | 1.000 | -0.073 | -0.071 | -0.046 | -0.169 | -0.128 | -0.045 | -0.209 | -0.128 |
| department_Finance | 0.007 | -0.019 | -0.053 | 0.017 | 0.023 | -0.092 | 0.028 | -0.065 | 0.007 | -0.040 | ... | -0.001 | -0.073 | 1.000 | -0.049 | -0.032 | -0.116 | -0.088 | -0.031 | -0.144 | -0.088 |
| department_HR | 0.009 | -0.041 | -0.007 | -0.054 | -0.077 | -0.025 | 0.024 | -0.024 | -0.006 | -0.233 | ... | -0.017 | -0.071 | -0.049 | 1.000 | -0.031 | -0.113 | -0.086 | -0.030 | -0.140 | -0.086 |
| department_Legal | -0.006 | -0.019 | -0.054 | 0.049 | -0.044 | -0.022 | 0.006 | -0.061 | 0.001 | -0.029 | ... | -0.002 | -0.046 | -0.032 | -0.031 | 1.000 | -0.074 | -0.056 | -0.020 | -0.091 | -0.056 |
| department_Operations | -0.002 | 0.030 | -0.001 | -0.123 | -0.078 | 0.083 | 0.120 | 0.072 | 0.000 | -0.092 | ... | 0.003 | -0.169 | -0.116 | -0.113 | -0.074 | 1.000 | -0.205 | -0.071 | -0.333 | -0.204 |
| department_Procurement | 0.005 | -0.070 | 0.060 | -0.134 | 0.031 | 0.047 | -0.012 | 0.037 | 0.002 | 0.216 | ... | 0.004 | -0.128 | -0.088 | -0.086 | -0.056 | -0.205 | 1.000 | -0.054 | -0.253 | -0.156 |
| department_R&D | -0.005 | 0.014 | 0.050 | 0.073 | 0.037 | -0.030 | 0.024 | -0.035 | -0.001 | 0.203 | ... | 0.001 | -0.045 | -0.031 | -0.030 | -0.020 | -0.071 | -0.054 | 1.000 | -0.088 | -0.054 |
| department_Sales & Marketing | -0.003 | 0.009 | 0.005 | 0.152 | 0.028 | 0.032 | -0.129 | 0.026 | -0.009 | -0.671 | ... | 0.013 | -0.209 | -0.144 | -0.140 | -0.091 | -0.333 | -0.253 | -0.088 | 1.000 | -0.253 |
| department_Technology | -0.003 | -0.030 | 0.012 | -0.075 | 0.007 | -0.013 | -0.053 | -0.012 | 0.007 | 0.472 | ... | -0.016 | -0.128 | -0.088 | -0.086 | -0.056 | -0.204 | -0.156 | -0.054 | -0.253 | 1.000 |
23 rows × 23 columns
# pull only the fields that have a correlation equal to or higher than 0.025, and only look at the is_promoted column
df_ohe.corr()[df_ohe.corr().round(3) >= 0.025].loc['is_promoted']
employee_id NaN region NaN education 0.025438 gender NaN no_of_trainings NaN age NaN previous_year_rating 0.153118 length_of_service NaN awards_won 0.195451 avg_training_score 0.171362 is_promoted 1.000000 recruitment_channel_other NaN recruitment_channel_referred NaN recruitment_channel_sourcing NaN department_Analytics NaN department_Finance NaN department_HR NaN department_Legal NaN department_Operations NaN department_Procurement NaN department_R&D NaN department_Sales & Marketing NaN department_Technology 0.029687 Name: is_promoted, dtype: float64
# reference previous versions of the dataframe
version_dict
{0: 'df', 1: 'df_dna', 2: 'df_tcat', 3: 'df_ohe'}
# correlataion heatmap of the df version: transform categorical
sns.heatmap(df_tcat.corr())
<AxesSubplot:>
sns.pairplot(df_ohe, diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x7f84339a1f70>
sns.pairplot(df_tcat, diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x7f83ea07b0a0>
# pull only the fields that have a correlation equal to or higher than 0, and only look at the is_promoted column
df_pc = pd.DataFrame(df_ohe.corr()['is_promoted'][df_ohe.corr()['is_promoted'] >= 0]) # add to new dataframe
df_pc = df_pc.drop(['employee_id','is_promoted'], axis='index') # remove employee id and is_promoted rows
df_pc # df positive correlation
| is_promoted | |
|---|---|
| education | 0.025438 |
| previous_year_rating | 0.153118 |
| awards_won | 0.195451 |
| avg_training_score | 0.171362 |
| recruitment_channel_referred | 0.018459 |
| department_Analytics | 0.011733 |
| department_Operations | 0.008470 |
| department_Procurement | 0.014683 |
| department_Technology | 0.029687 |
# create copy and new version: df feature engineering v04
df_feng = df_ohe.copy()
version_dict[4] = 'df_feng'
df_feng
| employee_id | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | ... | recruitment_channel_sourcing | department_Analytics | department_Finance | department_HR | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | 7 | 3 | 0 | 0.000000 | 3.555348 | 5.0 | 2.079442 | 0 | 3.891820 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 65141 | 22 | 2 | 1 | 0.000000 | 3.401197 | 5.0 | 1.386294 | 0 | 4.094345 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 7513 | 19 | 2 | 1 | 0.000000 | 3.526361 | 3.0 | 1.945910 | 0 | 3.912023 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3 | 2542 | 23 | 2 | 1 | 0.693147 | 3.663562 | 1.0 | 2.302585 | 0 | 3.912023 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 48945 | 26 | 2 | 1 | 0.000000 | 3.806662 | 3.0 | 0.693147 | 0 | 4.290459 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54802 | 6915 | 14 | 2 | 1 | 0.693147 | 3.433987 | 1.0 | 0.693147 | 0 | 3.891820 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 54803 | 3030 | 14 | 2 | 1 | 0.000000 | 3.871201 | 3.0 | 2.833213 | 0 | 4.356709 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 54804 | 74592 | 27 | 3 | 0 | 0.000000 | 3.610918 | 2.0 | 1.791759 | 0 | 4.025352 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 54805 | 13918 | 1 | 2 | 1 | 0.000000 | 3.295837 | 5.0 | 1.098612 | 0 | 4.369448 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 54807 | 51526 | 22 | 2 | 1 | 0.000000 | 3.295837 | 1.0 | 1.609438 | 0 | 3.891820 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
52399 rows × 23 columns
df_feng['r_edu_age'] = df_feng['education'].values / df_feng['age'].values
df_feng['r_score_serv'] = df_feng['avg_training_score'].values / df_feng['length_of_service'].values
df_feng['r_serv_age'] = df_feng['age'].values / df_feng['length_of_service'].values
df_feng['r_prevyear_score'] = df_feng['previous_year_rating'].values / df_feng['avg_training_score'].values
df_feng
<ipython-input-227-90caf8c8b4bf>:2: RuntimeWarning: divide by zero encountered in true_divide df_feng['r_score_serv'] = df_feng['avg_training_score'].values / df_feng['length_of_service'].values <ipython-input-227-90caf8c8b4bf>:3: RuntimeWarning: divide by zero encountered in true_divide df_feng['r_serv_age'] = df_feng['age'].values / df_feng['length_of_service'].values
| employee_id | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | ... | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | r_edu_age | r_score_serv | r_serv_age | r_prevyear_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | 7 | 3 | 0 | 0.000000 | 3.555348 | 5.0 | 2.079442 | 0 | 3.891820 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0.843799 | 1.871570 | 1.709761 | 1.284746 |
| 1 | 65141 | 22 | 2 | 1 | 0.000000 | 3.401197 | 5.0 | 1.386294 | 0 | 4.094345 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0.588028 | 2.953445 | 2.453445 | 1.221197 |
| 2 | 7513 | 19 | 2 | 1 | 0.000000 | 3.526361 | 3.0 | 1.945910 | 0 | 3.912023 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0.567157 | 2.010382 | 1.812191 | 0.766867 |
| 3 | 2542 | 23 | 2 | 1 | 0.693147 | 3.663562 | 1.0 | 2.302585 | 0 | 3.912023 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0.545917 | 1.698970 | 1.591065 | 0.255622 |
| 4 | 48945 | 26 | 2 | 1 | 0.000000 | 3.806662 | 3.0 | 0.693147 | 0 | 4.290459 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0.525395 | 6.189825 | 5.491853 | 0.699226 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54802 | 6915 | 14 | 2 | 1 | 0.693147 | 3.433987 | 1.0 | 0.693147 | 0 | 3.891820 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0.582413 | 5.614710 | 4.954196 | 0.256949 |
| 54803 | 3030 | 14 | 2 | 1 | 0.000000 | 3.871201 | 3.0 | 2.833213 | 0 | 4.356709 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0.516636 | 1.537727 | 1.366364 | 0.688593 |
| 54804 | 74592 | 27 | 3 | 0 | 0.000000 | 3.610918 | 2.0 | 1.791759 | 0 | 4.025352 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0.830814 | 2.246592 | 2.015292 | 0.496851 |
| 54805 | 13918 | 1 | 2 | 1 | 0.000000 | 3.295837 | 5.0 | 1.098612 | 0 | 4.369448 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0.606826 | 3.977243 | 3.000000 | 1.144309 |
| 54807 | 51526 | 22 | 2 | 1 | 0.000000 | 3.295837 | 1.0 | 1.609438 | 0 | 3.891820 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0.606826 | 2.418124 | 2.047819 | 0.256949 |
52399 rows × 27 columns
df_feng.corr()
| employee_id | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | ... | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | r_edu_age | r_score_serv | r_serv_age | r_prevyear_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| employee_id | 1.000000 | 0.003448 | 0.003816 | -0.001940 | -0.003732 | 0.000031 | 0.004309 | -0.000043 | 0.007586 | -0.000008 | ... | -0.006457 | -0.002145 | 0.004760 | -0.005463 | -0.003372 | -0.003194 | 0.004199 | -0.004865 | -0.004904 | 0.004269 |
| region | 0.003448 | 1.000000 | -0.076109 | 0.104107 | -0.002367 | -0.238613 | -0.022493 | -0.140516 | 0.007803 | 0.023410 | ... | -0.018935 | 0.030050 | -0.070046 | 0.013517 | 0.009061 | -0.029905 | -0.010340 | 0.104200 | 0.080951 | -0.026181 |
| education | 0.003816 | -0.076109 | 1.000000 | -0.017288 | -0.037741 | 0.418837 | 0.009864 | 0.305105 | 0.000495 | 0.015572 | ... | -0.054021 | -0.000788 | 0.059840 | 0.049640 | 0.004625 | 0.011519 | 0.964094 | -0.221883 | -0.185208 | 0.007769 |
| gender | -0.001940 | 0.104107 | -0.017288 | 1.000000 | 0.089994 | -0.001691 | -0.022008 | -0.011807 | 0.002721 | -0.031485 | ... | 0.049347 | -0.122858 | -0.133642 | 0.073239 | 0.152343 | -0.074502 | -0.017530 | 0.007864 | 0.011339 | -0.018438 |
| no_of_trainings | -0.003732 | -0.002367 | -0.037741 | 0.089994 | 1.000000 | -0.087983 | -0.059099 | -0.059028 | -0.007656 | 0.048231 | ... | -0.043668 | -0.077603 | 0.031413 | 0.037365 | 0.027937 | 0.006601 | -0.015791 | 0.054771 | 0.041618 | -0.069209 |
| age | 0.000031 | -0.238613 | 0.418837 | -0.001691 | -0.087983 | 1.000000 | 0.006646 | 0.640160 | -0.007233 | -0.054701 | ... | -0.022477 | 0.083489 | 0.047488 | -0.029758 | 0.031521 | -0.012510 | 0.168239 | -0.484376 | -0.386218 | 0.013585 |
| previous_year_rating | 0.004309 | -0.022493 | 0.009864 | -0.022008 | -0.059099 | 0.006646 | 1.000000 | 0.000637 | 0.026890 | 0.069899 | ... | 0.006490 | 0.120322 | -0.011593 | 0.024000 | -0.128653 | -0.052704 | 0.008026 | 0.006005 | -0.000967 | 0.990047 |
| length_of_service | -0.000043 | -0.140516 | 0.305105 | -0.011807 | -0.059028 | 0.640160 | 0.000637 | 1.000000 | -0.032284 | -0.039330 | ... | -0.061011 | 0.071555 | 0.036736 | -0.035148 | 0.025760 | -0.011799 | 0.149802 | -0.902118 | -0.897274 | 0.005349 |
| awards_won | 0.007586 | 0.007803 | 0.000495 | 0.002721 | -0.007656 | -0.007233 | 0.026890 | -0.032284 | 1.000000 | 0.066593 | ... | 0.000931 | 0.000410 | 0.002337 | -0.001408 | -0.008562 | 0.007038 | 0.002671 | 0.035513 | 0.030870 | 0.016408 |
| avg_training_score | -0.000008 | 0.023410 | 0.015572 | -0.031485 | 0.048231 | -0.054701 | 0.069899 | -0.039330 | 0.066593 | 1.000000 | ... | -0.029085 | -0.092418 | 0.216256 | 0.203394 | -0.670833 | 0.471735 | 0.031238 | 0.143253 | 0.031816 | -0.062430 |
| is_promoted | 0.000751 | -0.011738 | 0.025438 | -0.010575 | -0.024425 | -0.015876 | 0.153118 | -0.007650 | 0.195451 | 0.171362 | ... | -0.017928 | 0.008470 | 0.014683 | -0.008668 | -0.028105 | 0.029687 | 0.031281 | 0.026850 | 0.007472 | 0.126655 |
| recruitment_channel_other | -0.004135 | 0.014206 | 0.011503 | -0.006517 | 0.013877 | 0.016845 | -0.015076 | 0.006486 | 0.005795 | -0.000559 | ... | 0.004409 | -0.002758 | 0.004286 | -0.000150 | -0.006179 | -0.005281 | 0.008141 | -0.006113 | -0.004125 | -0.014924 |
| recruitment_channel_referred | 0.000398 | -0.048851 | -0.036450 | 0.008714 | -0.013583 | -0.046139 | 0.066943 | -0.033423 | 0.003249 | 0.028287 | ... | -0.007917 | -0.000865 | -0.028384 | -0.001959 | -0.022907 | 0.072946 | -0.027357 | 0.030724 | 0.022632 | 0.063163 |
| recruitment_channel_sourcing | 0.004041 | 0.000097 | -0.000836 | 0.003989 | -0.009958 | -0.003356 | -0.004549 | 0.003318 | -0.006785 | -0.007767 | ... | -0.002104 | 0.003029 | 0.004047 | 0.000727 | 0.012960 | -0.016169 | -0.000132 | -0.002732 | -0.002393 | -0.003589 |
| department_Analytics | 0.000370 | 0.105134 | -0.042752 | 0.145812 | 0.062130 | -0.096844 | 0.054960 | -0.059592 | 0.001643 | 0.481072 | ... | -0.046181 | -0.168872 | -0.128440 | -0.044702 | -0.208585 | -0.128156 | -0.016386 | 0.103602 | 0.041427 | -0.010201 |
| department_Finance | 0.007306 | -0.018551 | -0.052677 | 0.017249 | 0.023293 | -0.091539 | 0.027774 | -0.064771 | 0.006594 | -0.040056 | ... | -0.031773 | -0.116188 | -0.088370 | -0.030756 | -0.143511 | -0.088175 | -0.031811 | 0.045420 | 0.041657 | 0.032570 |
| department_HR | 0.008834 | -0.040508 | -0.007284 | -0.053947 | -0.077457 | -0.025289 | 0.023790 | -0.023601 | -0.006255 | -0.232670 | ... | -0.031005 | -0.113379 | -0.086233 | -0.030013 | -0.140041 | -0.086042 | -0.004049 | -0.010752 | 0.011946 | 0.057347 |
| department_Legal | -0.006457 | -0.018935 | -0.054021 | 0.049347 | -0.043668 | -0.022477 | 0.006490 | -0.061011 | 0.000931 | -0.029085 | ... | 1.000000 | -0.073685 | -0.056043 | -0.019505 | -0.091013 | -0.055919 | -0.054127 | 0.039538 | 0.043993 | 0.010117 |
| department_Operations | -0.002145 | 0.030050 | -0.000788 | -0.122858 | -0.077603 | 0.083489 | 0.120322 | 0.071555 | 0.000410 | -0.092418 | ... | -0.073685 | 1.000000 | -0.204935 | -0.071326 | -0.332813 | -0.204483 | -0.025612 | -0.067687 | -0.050324 | 0.131483 |
| department_Procurement | 0.004760 | -0.070046 | 0.059840 | -0.133642 | 0.031413 | 0.047488 | -0.011593 | 0.036736 | 0.002337 | 0.216256 | ... | -0.056043 | -0.204935 | 1.000000 | -0.054249 | -0.253129 | -0.155524 | 0.048991 | -0.006250 | -0.023860 | -0.040850 |
| department_R&D | -0.005463 | 0.013517 | 0.049640 | 0.073239 | 0.037365 | -0.029758 | 0.024000 | -0.035148 | -0.001408 | 0.203394 | ... | -0.019505 | -0.071326 | -0.054249 | 1.000000 | -0.088099 | -0.054129 | 0.065224 | 0.053793 | 0.029798 | -0.003595 |
| department_Sales & Marketing | -0.003372 | 0.009061 | 0.004625 | 0.152343 | 0.027937 | 0.031521 | -0.128653 | 0.025760 | -0.008562 | -0.670833 | ... | -0.091013 | -0.332813 | -0.253129 | -0.088099 | 1.000000 | -0.252570 | 0.000604 | -0.092217 | -0.018796 | -0.041785 |
| department_Technology | -0.003194 | -0.029905 | 0.011519 | -0.074502 | 0.006601 | -0.012510 | -0.052704 | -0.011799 | 0.007038 | 0.471735 | ... | -0.055919 | -0.204483 | -0.155524 | -0.054129 | -0.252570 | 1.000000 | 0.013623 | 0.064872 | 0.013199 | -0.110392 |
| r_edu_age | 0.004199 | -0.010340 | 0.964094 | -0.017530 | -0.015791 | 0.168239 | 0.008026 | 0.149802 | 0.002671 | 0.031238 | ... | -0.054127 | -0.025612 | 0.048991 | 0.065224 | 0.000604 | 0.013623 | 1.000000 | -0.106421 | -0.093887 | 0.003926 |
| r_score_serv | -0.004865 | 0.104200 | -0.221883 | 0.007864 | 0.054771 | -0.484376 | 0.006005 | -0.902118 | 0.035513 | 0.143253 | ... | 0.039538 | -0.067687 | -0.006250 | 0.053793 | -0.092217 | 0.064872 | -0.106421 | 1.000000 | 0.984930 | -0.012307 |
| r_serv_age | -0.004904 | 0.080951 | -0.185208 | 0.011339 | 0.041618 | -0.386218 | -0.000967 | -0.897274 | 0.030870 | 0.031816 | ... | 0.043993 | -0.050324 | -0.023860 | 0.029798 | -0.018796 | 0.013199 | -0.093887 | 0.984930 | 1.000000 | -0.005094 |
| r_prevyear_score | 0.004269 | -0.026181 | 0.007769 | -0.018438 | -0.069209 | 0.013585 | 0.990047 | 0.005349 | 0.016408 | -0.062430 | ... | 0.010117 | 0.131483 | -0.040850 | -0.003595 | -0.041785 | -0.110392 | 0.003926 | -0.012307 | -0.005094 | 1.000000 |
27 rows × 27 columns
# pull only the fields that have a correlation equal to or higher than 0, and only look at the is_promoted column
df_fepc = pd.DataFrame(df_feng.corr()['is_promoted'][df_feng.corr()['is_promoted'] >= 0]) # add to new dataframe
df_fepc = df_fepc.drop(['employee_id','is_promoted'], axis='index') # remove employee id and is_promoted rows
df_fepc # df feature eng positive correlation
| is_promoted | |
|---|---|
| education | 0.025438 |
| previous_year_rating | 0.153118 |
| awards_won | 0.195451 |
| avg_training_score | 0.171362 |
| recruitment_channel_referred | 0.018459 |
| department_Analytics | 0.011733 |
| department_Operations | 0.008470 |
| department_Procurement | 0.014683 |
| department_Technology | 0.029687 |
| r_edu_age | 0.031281 |
| r_score_serv | 0.026850 |
| r_serv_age | 0.007472 |
| r_prevyear_score | 0.126655 |
df_feng.describe()
| employee_id | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | ... | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | r_edu_age | r_score_serv | r_serv_age | r_prevyear_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | ... | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 5.239900e+04 | 5.239900e+04 | 52399.000000 |
| mean | 39184.187141 | 14.239184 | 2.269471 | 0.696158 | 0.156515 | 3.532928 | 3.337526 | 1.533522 | 0.023168 | 4.137473 | ... | 0.019752 | 0.212256 | 0.134850 | 0.018531 | 0.291322 | 0.134335 | 0.641174 | inf | inf | 0.807561 |
| std | 22598.386766 | 10.057990 | 0.477060 | 0.459920 | 0.337045 | 0.205642 | 1.212211 | 0.731684 | 0.150439 | 0.202777 | ... | 0.139149 | 0.408909 | 0.341566 | 0.134862 | 0.454376 | 0.341015 | 0.122845 | NaN | NaN | 0.295311 |
| min | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 2.995732 | 1.000000 | 0.000000 | 0.000000 | 3.663562 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.291207 | 1.072082e+00 | 1.133879e+00 | 0.218104 |
| 25% | 19651.500000 | 4.000000 | 2.000000 | 0.000000 | 0.000000 | 3.367296 | 3.000000 | 1.098612 | 0.000000 | 3.951244 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.567157 | 2.020559e+00 | 1.761801e+00 | 0.684615 |
| 50% | 39207.000000 | 13.000000 | 2.000000 | 1.000000 | 0.000000 | 3.496508 | 3.000000 | 1.609438 | 0.000000 | 4.127134 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.593948 | 2.583620e+00 | 2.172502e+00 | 0.766867 |
| 75% | 58738.500000 | 22.000000 | 3.000000 | 1.000000 | 0.000000 | 3.663562 | 4.000000 | 2.079442 | 0.000000 | 4.330733 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.770848 | 3.799691e+00 | 3.182658e+00 | 1.017339 |
| max | 78298.000000 | 34.000000 | 3.000000 | 1.000000 | 2.302585 | 4.094345 | 5.000000 | 3.610918 | 1.000000 | 4.595120 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.001425 | inf | inf | 1.355425 |
8 rows × 27 columns
df_feng.drop(['r_score_serv', 'r_serv_age'], axis = 1, inplace=True)
df_feng.describe()
| employee_id | region | education | gender | no_of_trainings | age | previous_year_rating | length_of_service | awards_won | avg_training_score | ... | department_Finance | department_HR | department_Legal | department_Operations | department_Procurement | department_R&D | department_Sales & Marketing | department_Technology | r_edu_age | r_prevyear_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | ... | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 | 52399.000000 |
| mean | 39184.187141 | 14.239184 | 2.269471 | 0.696158 | 0.156515 | 3.532928 | 3.337526 | 1.533522 | 0.023168 | 4.137473 | ... | 0.047711 | 0.045535 | 0.019752 | 0.212256 | 0.134850 | 0.018531 | 0.291322 | 0.134335 | 0.641174 | 0.807561 |
| std | 22598.386766 | 10.057990 | 0.477060 | 0.459920 | 0.337045 | 0.205642 | 1.212211 | 0.731684 | 0.150439 | 0.202777 | ... | 0.213156 | 0.208477 | 0.139149 | 0.408909 | 0.341566 | 0.134862 | 0.454376 | 0.341015 | 0.122845 | 0.295311 |
| min | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 2.995732 | 1.000000 | 0.000000 | 0.000000 | 3.663562 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.291207 | 0.218104 |
| 25% | 19651.500000 | 4.000000 | 2.000000 | 0.000000 | 0.000000 | 3.367296 | 3.000000 | 1.098612 | 0.000000 | 3.951244 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.567157 | 0.684615 |
| 50% | 39207.000000 | 13.000000 | 2.000000 | 1.000000 | 0.000000 | 3.496508 | 3.000000 | 1.609438 | 0.000000 | 4.127134 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.593948 | 0.766867 |
| 75% | 58738.500000 | 22.000000 | 3.000000 | 1.000000 | 0.000000 | 3.663562 | 4.000000 | 2.079442 | 0.000000 | 4.330733 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.770848 | 1.017339 |
| max | 78298.000000 | 34.000000 | 3.000000 | 1.000000 | 2.302585 | 4.094345 | 5.000000 | 3.610918 | 1.000000 | 4.595120 | ... | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.001425 | 1.355425 |
8 rows × 25 columns
# pull only the fields that have a negative correlation equal to or lower than 0, and only look at the is_promoted column
df_nc = pd.DataFrame(df_ohe.corr()['is_promoted'][df_ohe.corr()['is_promoted'] <= 0]) # add to new dataframe
df_nc # df positive correlation
| is_promoted | |
|---|---|
| region | -0.011738 |
| gender | -0.010575 |
| no_of_trainings | -0.024425 |
| age | -0.015876 |
| length_of_service | -0.007650 |
| recruitment_channel_other | -0.005355 |
| recruitment_channel_sourcing | -0.000050 |
| department_Finance | -0.003465 |
| department_HR | -0.023417 |
| department_Legal | -0.017928 |
| department_R&D | -0.008668 |
| department_Sales & Marketing | -0.028105 |
df_feng.drop(['region', 'gender', 'no_of_trainings', 'age', 'length_of_service',
'recruitment_channel_other', 'recruitment_channel_sourcing',
'department_Finance', 'department_HR', 'department_Legal',
'department_R&D', 'department_Sales & Marketing'], axis=1, inplace=True)
df_feng
| employee_id | education | previous_year_rating | awards_won | avg_training_score | is_promoted | recruitment_channel_referred | department_Analytics | department_Operations | department_Procurement | department_Technology | r_edu_age | r_prevyear_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | 3 | 5.0 | 0 | 3.891820 | 0 | 0 | 0 | 0 | 0 | 0 | 0.843799 | 1.284746 |
| 1 | 65141 | 2 | 5.0 | 0 | 4.094345 | 0 | 0 | 0 | 1 | 0 | 0 | 0.588028 | 1.221197 |
| 2 | 7513 | 2 | 3.0 | 0 | 3.912023 | 0 | 0 | 0 | 0 | 0 | 0 | 0.567157 | 0.766867 |
| 3 | 2542 | 2 | 1.0 | 0 | 3.912023 | 0 | 0 | 0 | 0 | 0 | 0 | 0.545917 | 0.255622 |
| 4 | 48945 | 2 | 3.0 | 0 | 4.290459 | 0 | 0 | 0 | 0 | 0 | 1 | 0.525395 | 0.699226 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 54802 | 6915 | 2 | 1.0 | 0 | 3.891820 | 0 | 0 | 0 | 0 | 0 | 0 | 0.582413 | 0.256949 |
| 54803 | 3030 | 2 | 3.0 | 0 | 4.356709 | 0 | 0 | 0 | 0 | 0 | 1 | 0.516636 | 0.688593 |
| 54804 | 74592 | 3 | 2.0 | 0 | 4.025352 | 0 | 0 | 0 | 1 | 0 | 0 | 0.830814 | 0.496851 |
| 54805 | 13918 | 2 | 5.0 | 0 | 4.369448 | 0 | 0 | 1 | 0 | 0 | 0 | 0.606826 | 1.144309 |
| 54807 | 51526 | 2 | 1.0 | 0 | 3.891820 | 0 | 0 | 0 | 0 | 0 | 0 | 0.606826 | 0.256949 |
52399 rows × 13 columns
X = df_feng.drop(['is_promoted'], axis = 1)
y = df_feng['is_promoted']
X_train, X_validation, y_train, y_validation = train_test_split(X, y , test_size=0.3, random_state=88)
k = KFold(random_state=88, n_splits=5, shuffle=True)
algo= []
cv_r2_mean = []
cv_r2_std = []
cv_rmse_mean = []
pipeline = Pipeline([
('scaler',StandardScaler()),
('reg', LinearRegression())
])
pipeline.fit(X,y)
algo.append('Linear Regression')
scores = cross_val_score(pipeline, X, y, cv=k)
cv_r2_mean.append(scores.mean())
cv_r2_std.append(scores.std())
cv_rmse_mean.append((-cross_val_score(pipeline, X, y, cv=k,scoring='neg_mean_squared_error').mean())**0.5)
model = DecisionTreeRegressor(random_state=88)
algo.append('Decision Tree')
scores = cross_val_score(model, X, y, cv=k)
cv_r2_mean.append(scores.mean())
cv_r2_std.append(scores.std())
cv_rmse_mean.append((-cross_val_score(model, X, y, cv=k,scoring='neg_mean_squared_error').mean())**0.5)
model = RandomForestRegressor(random_state=88)
#model.fit(xtrain,ytrain)
algo.append('Random Forest')
scores = cross_val_score(model, X, y, cv=k)
cv_r2_mean.append(scores.mean())
cv_r2_std.append(scores.std())
cv_rmse_mean.append((-cross_val_score(model, X, y, cv=k, scoring='neg_mean_squared_error').mean())**0.5)
model = AdaBoostRegressor(random_state=88)
algo.append('AdaBoost')
scores = cross_val_score(model, X, y, cv=k)
cv_r2_mean.append(scores.mean())
cv_r2_std.append(scores.std())
cv_rmse_mean.append((-cross_val_score(model, X, y, cv=k,scoring='neg_mean_squared_error').mean())**0.5)
model = GradientBoostingRegressor(random_state=88)
algo.append('Gradient Boosting')
scores = cross_val_score(model, X, y, cv=k)
cv_r2_mean.append(scores.mean())
cv_r2_std.append(scores.std())
cv_rmse_mean.append((-cross_val_score(model, X, y, cv=k,scoring='neg_mean_squared_error').mean())**0.5)
results = pd.DataFrame()
results['Model'] = algo
results['CV R2 score mean'] = cv_r2_mean
results['CV R2 score std'] = cv_r2_std
results['CV RMSE'] = cv_rmse_mean
results = results.set_index('Model')
results
| CV R2 score mean | CV R2 score std | CV RMSE | |
|---|---|---|---|
| Model | |||
| Linear Regression | 0.114667 | 0.007018 | 0.264857 |
| Decision Tree | -0.529219 | 0.042597 | 0.348007 |
| Random Forest | 0.174600 | 0.010003 | 0.255729 |
| AdaBoost | -0.007928 | 0.179463 | 0.282366 |
| Gradient Boosting | 0.242597 | 0.012367 | 0.244979 |
# removed features that were negatively correlated
# results02 = results.copy()
results02
| CV R2 score mean | CV R2 score std | CV RMSE | |
|---|---|---|---|
| Model | |||
| Linear Regression | 0.114667 | 0.007018 | 0.264857 |
| Decision Tree | -0.529219 | 0.042597 | 0.348007 |
| Random Forest | 0.174600 | 0.010003 | 0.255729 |
| AdaBoost | -0.007928 | 0.179463 | 0.282366 |
| Gradient Boosting | 0.242597 | 0.012367 | 0.244979 |
# initial run
# results01 = results.copy()
results01
| CV R2 score mean | CV R2 score std | CV RMSE | |
|---|---|---|---|
| Model | |||
| Linear Regression | 0.174710 | 0.005590 | 0.255717 |
| Decision Tree | -0.499530 | 0.022689 | 0.344646 |
| Random Forest | 0.267541 | 0.005571 | 0.240901 |
| AdaBoost | 0.065075 | 0.152098 | 0.271905 |
| Gradient Boosting | 0.284001 | 0.008403 | 0.238192 |
rf = RandomForestRegressor(random_state=88, n_jobs=-1)
params = {
'max_depth': [1,3,5,10,20,40,50,60],
'min_samples_split': [10,50,100,500],
'n_estimators': [10,25,100,200,500]
}
grid = RandomizedSearchCV(rf, params, cv = k, n_jobs=-1, n_iter=30)
grid.fit(X,y)
RandomizedSearchCV(cv=KFold(n_splits=5, random_state=88, shuffle=True),
estimator=RandomForestRegressor(n_jobs=-1, random_state=88),
n_iter=30, n_jobs=-1,
param_distributions={'max_depth': [1, 3, 5, 10, 20, 40, 50,
60],
'min_samples_split': [10, 50, 100, 500],
'n_estimators': [10, 25, 100, 200,
500]})
grid.best_params_
{'n_estimators': 200, 'min_samples_split': 10, 'max_depth': 10}
grid_results = pd.DataFrame(grid.cv_results_)
grid_results
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_n_estimators | param_min_samples_split | param_max_depth | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 16.939270 | 0.117264 | 0.049058 | 0.011314 | 100 | 10 | 50 | {'n_estimators': 100, 'min_samples_split': 10,... | 0.206666 | 0.227671 | 0.229960 | 0.207001 | 0.230785 | 0.220417 | 0.011138 | 11 |
| 1 | 0.954085 | 0.061131 | 0.013471 | 0.000135 | 25 | 100 | 3 | {'n_estimators': 25, 'min_samples_split': 100,... | 0.132872 | 0.148209 | 0.149613 | 0.123968 | 0.146979 | 0.140328 | 0.010157 | 26 |
| 2 | 2.086960 | 0.086527 | 0.011339 | 0.000211 | 10 | 50 | 40 | {'n_estimators': 10, 'min_samples_split': 50, ... | 0.221087 | 0.235295 | 0.237677 | 0.213034 | 0.242897 | 0.229998 | 0.011141 | 9 |
| 3 | 1.668671 | 0.101171 | 0.033927 | 0.002568 | 100 | 50 | 1 | {'n_estimators': 100, 'min_samples_split': 50,... | 0.081351 | 0.100641 | 0.089318 | 0.086125 | 0.096505 | 0.090788 | 0.006967 | 28 |
| 4 | 20.840210 | 0.098424 | 0.176224 | 0.042031 | 500 | 10 | 3 | {'n_estimators': 500, 'min_samples_split': 10,... | 0.134133 | 0.148188 | 0.150021 | 0.123739 | 0.148873 | 0.140991 | 0.010394 | 24 |
| 5 | 5.614264 | 0.258157 | 0.017990 | 0.000417 | 25 | 10 | 50 | {'n_estimators': 25, 'min_samples_split': 10, ... | 0.194557 | 0.214273 | 0.212465 | 0.197246 | 0.221394 | 0.207987 | 0.010344 | 16 |
| 6 | 0.588905 | 0.031772 | 0.011295 | 0.000225 | 10 | 500 | 3 | {'n_estimators': 10, 'min_samples_split': 500,... | 0.127310 | 0.141193 | 0.144098 | 0.114836 | 0.136729 | 0.132833 | 0.010643 | 27 |
| 7 | 1.493262 | 0.136291 | 0.011476 | 0.000206 | 10 | 50 | 10 | {'n_estimators': 10, 'min_samples_split': 50, ... | 0.237565 | 0.243925 | 0.245422 | 0.221708 | 0.255508 | 0.240825 | 0.011157 | 7 |
| 8 | 24.715356 | 0.233873 | 0.071653 | 0.013266 | 200 | 10 | 10 | {'n_estimators': 200, 'min_samples_split': 10,... | 0.241452 | 0.252884 | 0.262153 | 0.231601 | 0.255017 | 0.248621 | 0.010800 | 1 |
| 9 | 29.601928 | 0.279766 | 0.064092 | 0.004789 | 200 | 500 | 20 | {'n_estimators': 200, 'min_samples_split': 500... | 0.204383 | 0.221838 | 0.225512 | 0.193555 | 0.217222 | 0.212502 | 0.011862 | 13 |
| 10 | 2.655569 | 0.182463 | 0.013370 | 0.000127 | 10 | 10 | 60 | {'n_estimators': 10, 'min_samples_split': 10, ... | 0.174150 | 0.192302 | 0.186541 | 0.171530 | 0.198901 | 0.184685 | 0.010465 | 18 |
| 11 | 58.395502 | 0.149775 | 0.214357 | 0.067472 | 500 | 50 | 10 | {'n_estimators': 500, 'min_samples_split': 50,... | 0.238503 | 0.252152 | 0.259882 | 0.226424 | 0.253920 | 0.246176 | 0.012104 | 2 |
| 12 | 2.000267 | 0.060082 | 0.015830 | 0.000985 | 25 | 100 | 5 | {'n_estimators': 25, 'min_samples_split': 100,... | 0.169243 | 0.179415 | 0.184192 | 0.159569 | 0.184351 | 0.175354 | 0.009613 | 21 |
| 13 | 12.631443 | 0.590534 | 0.038132 | 0.005575 | 100 | 50 | 10 | {'n_estimators': 100, 'min_samples_split': 50,... | 0.238923 | 0.251090 | 0.260135 | 0.226219 | 0.253201 | 0.245914 | 0.011991 | 3 |
| 14 | 2.897875 | 0.183829 | 0.013214 | 0.000245 | 10 | 10 | 40 | {'n_estimators': 10, 'min_samples_split': 10, ... | 0.174279 | 0.193035 | 0.186469 | 0.171530 | 0.199105 | 0.184884 | 0.010602 | 17 |
| 15 | 2.378587 | 0.147371 | 0.012310 | 0.000266 | 10 | 10 | 20 | {'n_estimators': 10, 'min_samples_split': 10, ... | 0.204683 | 0.218882 | 0.215051 | 0.197222 | 0.221471 | 0.211462 | 0.009131 | 14 |
| 16 | 83.183156 | 0.804597 | 0.450697 | 0.051243 | 500 | 100 | 40 | {'n_estimators': 500, 'min_samples_split': 100... | 0.236440 | 0.253744 | 0.256688 | 0.224471 | 0.255964 | 0.245462 | 0.012860 | 5 |
| 17 | 3.977380 | 0.071712 | 0.033419 | 0.001096 | 100 | 50 | 3 | {'n_estimators': 100, 'min_samples_split': 50,... | 0.133922 | 0.149079 | 0.149785 | 0.124027 | 0.148655 | 0.141094 | 0.010384 | 23 |
| 18 | 0.824375 | 0.061372 | 0.011073 | 0.000109 | 10 | 10 | 5 | {'n_estimators': 10, 'min_samples_split': 10, ... | 0.169309 | 0.181635 | 0.188459 | 0.167077 | 0.182601 | 0.177816 | 0.008227 | 19 |
| 19 | 18.360088 | 0.358191 | 0.039622 | 0.002230 | 100 | 10 | 60 | {'n_estimators': 100, 'min_samples_split': 10,... | 0.206666 | 0.227671 | 0.229918 | 0.207001 | 0.230785 | 0.220408 | 0.011131 | 12 |
| 20 | 1.766845 | 0.057489 | 0.014608 | 0.000184 | 25 | 50 | 5 | {'n_estimators': 25, 'min_samples_split': 50, ... | 0.170366 | 0.181747 | 0.185898 | 0.162967 | 0.186832 | 0.177562 | 0.009357 | 20 |
| 21 | 1.750723 | 0.144214 | 0.015068 | 0.000695 | 25 | 500 | 5 | {'n_estimators': 25, 'min_samples_split': 500,... | 0.157840 | 0.166581 | 0.170864 | 0.147433 | 0.170058 | 0.162555 | 0.008860 | 22 |
| 22 | 92.557526 | 0.810422 | 0.502848 | 0.038606 | 500 | 50 | 40 | {'n_estimators': 500, 'min_samples_split': 50,... | 0.233881 | 0.253940 | 0.256240 | 0.228421 | 0.256221 | 0.245741 | 0.012066 | 4 |
| 23 | 19.579130 | 0.295292 | 0.044843 | 0.005311 | 100 | 50 | 40 | {'n_estimators': 100, 'min_samples_split': 50,... | 0.232746 | 0.253298 | 0.255630 | 0.226714 | 0.255456 | 0.244769 | 0.012453 | 6 |
| 24 | 5.874663 | 0.425275 | 0.018307 | 0.000791 | 25 | 10 | 40 | {'n_estimators': 25, 'min_samples_split': 10, ... | 0.194596 | 0.214739 | 0.212304 | 0.197157 | 0.221732 | 0.208106 | 0.010485 | 15 |
| 25 | 86.690805 | 5.440012 | 0.324959 | 0.103415 | 500 | 10 | 60 | {'n_estimators': 500, 'min_samples_split': 10,... | 0.210064 | 0.230636 | 0.232775 | 0.209786 | 0.232026 | 0.223058 | 0.010745 | 10 |
| 26 | 4.038190 | 0.117740 | 0.059100 | 0.001030 | 200 | 50 | 1 | {'n_estimators': 200, 'min_samples_split': 50,... | 0.081342 | 0.100357 | 0.089587 | 0.086200 | 0.096424 | 0.090782 | 0.006856 | 29 |
| 27 | 3.560839 | 0.192950 | 0.060284 | 0.003197 | 200 | 10 | 1 | {'n_estimators': 200, 'min_samples_split': 10,... | 0.081342 | 0.100357 | 0.089587 | 0.086200 | 0.096424 | 0.090782 | 0.006856 | 29 |
| 28 | 20.349553 | 0.470240 | 0.162986 | 0.029722 | 500 | 50 | 3 | {'n_estimators': 500, 'min_samples_split': 50,... | 0.134193 | 0.148151 | 0.150006 | 0.123669 | 0.148803 | 0.140964 | 0.010390 | 25 |
| 29 | 1.687916 | 0.088348 | 0.011206 | 0.000315 | 10 | 100 | 10 | {'n_estimators': 10, 'min_samples_split': 100,... | 0.236442 | 0.241659 | 0.238712 | 0.217263 | 0.250300 | 0.236875 | 0.010875 | 8 |
plt.plot(grid_results['param_min_samples_split'],
grid_results['mean_test_score'],
'bo')
[<matplotlib.lines.Line2D at 0x7f83d0d838b0>]
plt.plot(grid_results['param_n_estimators'],
grid_results['mean_test_score'],
'bo')
[<matplotlib.lines.Line2D at 0x7f83cd2f7820>]
plt.plot(grid_results['param_n_estimators'],
grid_results['mean_test_score'],
'bo')
[<matplotlib.lines.Line2D at 0x7f83cc90a640>]
rf = RandomForestRegressor(random_state=88, n_jobs=-1)
params = {
'max_depth': [13,15,17],
'min_samples_split': [3,5,8],
'n_estimators': [150,200,250]
}
grid_search = GridSearchCV(rf, params, cv = k, n_jobs=-1)
grid_search.fit(X,y)
GridSearchCV(cv=KFold(n_splits=5, random_state=88, shuffle=True),
estimator=RandomForestRegressor(n_jobs=-1, random_state=88),
n_jobs=-1,
param_grid={'max_depth': [13, 15, 17],
'min_samples_split': [3, 5, 8],
'n_estimators': [150, 200, 250]})
rs = RandomizedSearchCV(estimator=GradientBoostingRegressor(random_state=42), param_distributions=params,
return_train_score= True, n_jobs=-1, verbose=2, cv = 10, n_iter=500)
rs.fit(X_train, y_train)
/Users/ivansaucedo/opt/anaconda3/envs/aiml/lib/python3.8/site-packages/sklearn/model_selection/_search.py:285: UserWarning: The total space of parameters 27 is smaller than n_iter=500. Running 27 iterations. For exhaustive searches, use GridSearchCV. warnings.warn(
Fitting 10 folds for each of 27 candidates, totalling 270 fits
grid_search.best_score_
0.2544053621210665
grid_search.best_params_
{'max_depth': 13, 'min_samples_split': 8, 'n_estimators': 250}